⛽ Fuel efficiency Prediction

Provided with the classic Auto MPG dataset, we will predict the fuel efficiency of the late-1970s and early 1980s automobiles, leveraging features such as cylinders, displacement, horsepower, weight, etc.

It is a very small dataset and there are only a few features. We will first build a linear model and a neural network, evaluate their performances, track our experiment runs and inspect the logs using MLflow, and apply TPOT to see how it can be used to search over many ML model architectures, followed by explaining the model with SHAP.

📚 Learning Objectives

By the end of this session, you will be able to

Note: State of Data Science and Machine Learning 2021 by Kaggle shows that the most commonly used algorithms were linear and logtistic regressions, followed closely by decision trees, random forests, and gradient boosting machines (are you surprised?). Multilayer perceptron, or artificial neural networks are not yet the popular tools for tabular/structured data; see more technical reasons in papers: Deep Neural Networks and Tabular Data: A Survey, Tabular Data: Deep Learning is Not All You Need. For this assignment, the main purpose is for you to get familiar with the basic building blocks in constructing neural networks before we dive into more specialized neural network architectures.

IMPORTANT

You only need to run the following cells if you're completing the assignment in Google Collab. If you've already installed these libraries locally, you can skip installing these libraries.

Task 1 - Data: Auto MPG dataset

  1. Start MLflow's automatic logging using library-specific autolog calls for tensorflow: logging metrics, parameters, and models without the need for explicit log statements.

    We will get into more details using MLflow after completing our experiment.

  1. The dataset is available from the UCI Machine Learning Repository. First download and import the dataset using pandas:
  1. The dataset contains a few unknown values, we drop those rows to keep this initial tutorial simple. Use pd.DataFrame.dropna():
  1. The "Origin" column is categorical, not numeric. So the next step is to one-hot encode the values in the column with pd.get_dummies.
  1. Split the data into training and test sets. To reduce the module importing overhead, instead of sklearn.model_selection.train_test_split(), use pd.DataFrame.sample() to save 80% of the data aside to train_dataset, set the random state to be 0 for reproducibility.

    Then use pd.DataFrame.drop() to obtain the test_dataset.

  1. Review the pairwise relationships of a few pairs of columns from the training set.

    The top row suggests that the fuel efficiency (MPG) is a function of all the other parameters. The other rows indicate they are functions of each other.

Let's also check the overall statistics. Note how each feature covers a very different range:

  1. Split features from labels. This means, separate the target value(also called"label") from the features. Label is the value that you will train the model to predict.

Task 2 - Normalization Layer

It is good practice to normalize features that use different scales and ranges. Although a model might converge without feature normalization, normalization makes training much more stable.

Similar to scikit-learn, tensorflow.keras offers a list of preprocessing layers so that you can build and export models that are truly end-to-end.

  1. The Normalization layer (tf.keras.layers.Normalization is a clean and simple way to add feature normalization into your model. The first step is to create the layer:
  1. Then, fit the state of the preprocessing layer to the data by calling Normalization.adapt:

We can see the feature mean and variance are stored in the layer:

When the layer is called, it returns the input data, with each feature independently normalized:

Task 3 - Linear Regression 📈

Before building a deep neural network model, start with linear regression using all the features.

Training a model with tf.keras typically starts by defining the model architecture. Use a tf.keras.Sequential model, which represents a sequence of steps.

There are two steps in this multivariate linear regression model:

The number of inputs can either be set by the input_shape argument, or automatically when the model is run for the first time.

  1. Build the Keras Sequential model:
  1. This model will predict 'MPG' from all features in train_features. Run the untrained model on the first 10 data points / rows using Model.predict(). The output won't be good, but notice that it has the expected shape of (10, 1):
  1. When you call the model, its weight matrices will be built—check that the kernel weights (the $m$ in $y = mx + b$) have a shape of (9, 1):
  1. Once the model is built, configure the training procedure using the Keras Model.compile method. The most important arguments to compile are the loss and the optimizer, since these define what will be optimized and how (using the tf.keras.optimizers.Adam).

    Here's a list of built-in loss functions in tf.keras.losses. For regression tasks, common loss functions include mean squared error (MSE) and mean absolute error (MAE). Here, MAE is preferred such that the model is more robust against outliers.

    For optimizers, gradient descent (check this video Gradient Descent, Step-by-Step for a refresher) is the preferred way to optimize neural networks and many other machine learning algorithms. Read an overview of graident descent optimizer algorithms for several popular gradient descent algorithms. Here, we use the popular tf.keras.optimizers.Adam, and set the learning rate at 0.1 for faster learning.

  1. Use Keras Model.fit to execute the training for 100 epochs, set the verbose to 0 to suppress logging and keep 20% of the data for validation:
  1. Visualize the model's training progress using the stats stored in the history object:

Use plot_loss(history) provided to visualize the progression in loss function for training and validation data sets.

  1. Collect the results on the test set for later using Model.evaluate()

Task 4 - Regression with a Deep Neural Network (DNN)

You just implemented a linear model for multiple inputs. Now, you are ready to implement multiple-input DNN models.

The code is very similar except the model is expanded to include some "hidden" non-linear layers. The name "hidden" here just means not directly connected to the inputs or outputs.

  1. Include the model and compile method in the build_and_compile_model function below.
  1. Create a DNN model with normalizer (defined earlier) as the normalization layer:
  1. Inspect the model using Model.summary(). This model has quite a few more trainable parameters than the linear models:
  1. Train the model with Keras Model.fit:
  1. Visualize the model's training progress using the stats stored in the history object.

Do you think the DNN model is overfitting? What gives away?

As the validation does not keep improving and the training keeps improving it is an ok model If the training and validation were both improving that would mean there is no difference and then it is overfitting

  1. Let's save the results for later comparison.

Task 5 - Make Predictions 🔮

  1. Since both models have been trained, we can review their test set performance:

These results match the validation error observed during training.

  1. We can now make predictions with the dnn_model on the test set using Keras Model.predict and review the loss. Use .flatten().
  1. It appears that the model predicts reasonably well. Now, check the error distribution:
  1. Save it for later use with Model.save:
  1. Reload the model with Model.load_model; it gives identical output:

Task 6 - Nonlinearity

We mentioned that the relu activation function introduce non-linearity; let's visualize it. Since there are six numerical features and 1 categorical features, it is impossible to plot all the dimensions on a 2D plot; we need to simplify/isolate it.

Note: in this task, code is provided; the focus in on understanding.

  1. We focus on the relationship between feature Displacement and target MPG.

    To do so, create a new dataset of the same size as train_features, but all other features are set at their median values; then set the Displacement between 0 and 500.

  1. Create a plotting function to:

    a) visualize real values between Displacement and MPG from the training dataset in scatter plot

    b) overlay the predicted MPG from Displacement varying from 0 to 500, but holding all other features constant.

  1. Visualize predicted MPG using the linear model.
  1. Visualize predicted MPG using the neural network model. Do you see an improvement/non-linearity from the linear model? ### yes it imitates the trend of the results
  1. What are the other activation functions? Check the list of activations.

    Optional. Modify the DNN model with a different activation function, and fit it on the data; does it perform better?

Trying with Tanh

  1. Overfitting is a common problem for DNN models, how should we deal with it? Check Regularizers on tf.keras. Any other techiniques that are invented for neural networks?

Task 7 - MLflow Tracking

In this task, we briefly explore MLflow Tracking, one of four primary functions that MLflow offers for managing the end-to-end machine learning lifecycle. We will access the information runs programmatically in python and then set up the MLflow UI for easy interaction.

  1. Experiments.

    MLflow Tracking is organized around the concept of runs, which are executions of some piece of modeling code; and runs are organized into experiments.

    We set the auto logging in the beginning, we can verify that

    • there is one experiment
    • its name is 0
    • all of its artifacts are stored at file:///content/mlruns/0 in Google Drive.
  1. Runs.

    List information for runs that are under experiment '0' using mlflow.list_run_infos().

  1. Retrieve the currently active run, i.e., the DNN model. Hint: mlflow.last_active_run()
  1. Use function print_auto_logged_info provided below to fetch the auto logged parameters and metrics for autolog_run.
  1. Optional. Retrieve the best run using MlflowClient().search_runs().
  1. To see what's logged in the file system /content/mlruns/, click tab files in the left sidepanel in Colab. For example,

     mlruns
     └── 0
         ├── 3a5aebdd35ef46fb8dc35b40e542f0a4
         │   ├── artifacts
         │   ├── meta.yaml
         │   ├── metrics
         │   ├── params
         │   └── tags
         ├── c627bc526c4a4c418a8285627e61a16d
         │   ├── artifacts
         │   ├── meta.yaml
         │   ├── metrics
         │   ├── params
         │   └── tags
         └── meta.yaml
    
     11 directories, 3 files

    Inspect the model summary of the DNN model you ran previously; it is located at artifacts/model_summary.txt of the corresponding run. Use cat $filepath.

It should show this (taken from collab)

The finder that mlflow got all the experiments in mlrun

Showing what is in dnn

  1. Tracking UI.

    MLflow provides an UI for us to visualize, search and compare runs, as well as download run artifacts or metadata for analysis in other tools.

    If your runs are logged to a local mlruns directory, run mlflow ui in the directory above it will load the corresponding runs.

    Running localhost server in Colab, however, requires a bit of extra work:

    NOTE. NEVER share your secrets. Best to keep NGROK_AUTH_TOKEN as an environment variable and retrieve it via os.environ.get("NGROK_AUTH_TOKEN").

  1. Interact with Tracking UI.

    Open the link, output from the previous cell. get oriented, Parameters, Metrics, Artifacts, and so on.

    When you are done, make sure to terminate the open tunnel:

Task 8 - AutoML with TPOT 🫖

  1. Instantiate and train a TPOT auto-ML regressor.

    The parameters are set fairly arbitrarily (if time permits, you shall experiment with different sets of parameters after reading what each parameter does). Use these parameter values:

    generations: 10

    population_size: 40

    scoring: negative mean absolute error; read more in scoring functions in TPOT

    verbosity: 2 (so you can see each generation's performance)

    The final line with create a Python script tpot_products_pipeline.py with the code to create the optimal model found by TPOT.

  1. Examine the model pipeline that TPOT regressor offers. If you see any model, function, or class that are not familiar, look them up!

    Note: There is randomness to the way the TPOT searches, so it's possible you won't have exactly the same result as your classmate.

  1. Take the appropriate lines (e.g., updating path to data and the variable names) from tpot_mpg_pipeline.py to build a model on our training set and make predictions on the test set. Save the predictions as y_pred, and compute appropriate evaluation metric. You may find that for this simple data set, the neural network we built outperforms the tree-based model, yet note it is not a conclusion that we can generalize for all tabular data.

Task 9 - Model Explainability

Last week, we introduced model explainability with SHAP and will continue to incorporate it as part our model output this week. You can use the Kernel Explainer for explainability of both the Neural Networks and the TPOT classifier.

Task 10 - Taking it to the Next Level! 📶

Let's take our models and make a model comparison demo like we did last week, but this time you're taking the lead!

  1. Save your training dataset as a CSV file so that it can be used in the Streamlit app.
  2. Build a results DataFrame and save it as a CSV so that it can be used in the Streamlit app.
  3. In Tab 1 - Raw Data:
  1. In Tab 2 - Model Results:
  1. In Tab 3 - Model Explainability:

Additional Resources

Acknowledgement and Copyright

Acknowledgement

This notebook is adapted from tensorflow/keras tuorial - regression

@title Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

@title MIT License

Copyright (c) 2017 François Chollet

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.